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1. Introduction. This is a fascinating paper on an important topic: the 
choice of predictor variables in large-scale linear models. A previous paper in 
these pages attacked the same problem using the "LARS" algorithm (Efron, 
Hastie, Johnstone and Tibshirani [3]); actually three algorithms including 
the Lasso as middle case. There are tantalizing similarities between the 
Dantzig Selector (DS) and the LARS methods, but they are not the same 
and produce somewhat different models. We explore this relationship with 
the Lasso and LARS here. 

2. Dantzig selector and the Lasso. The definition of the Dantzig selector 
(DS) in (1.7) can be re-expressed as 



(1) min||X T (y-X/3)||^ subject to ||/% < s. 

This makes it look very similar to the Lasso (Tibshirani [6] ) , or basis pursuit 
(Chen, Donoho and Saunders [1]): 

(2) min \\y - X(3\\e 2 subject to \\(3\\ £l < s. 

With a bound on the t\ norm of /3, Lasso minimizes the squared error 
while DS minimizes the maximum component of the gradient of the squared 
error function. If s is large so that the constraint has no effect, then these 
are the same. However, for other values of s, they are a little different; see 
Figure 1. 

The least angle regression (LARS) algorithm (Efron, Hastie, Johnstone 
and Tibshirami [3] ) for solving the Lasso path makes them look tantalizingly 
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Fig. 1. The Lasso and DS regularization paths for the diabetes data are mostly identical. 
The predictors are standardized to have mean zero and unit £2 norm, and were used to 
illustrate the LARS algorithms cited in the text. 

close (see also the homotopy algorithm of Osborne, Presnell and Turlach 
[4]). In LARS, we start with (3 = and identify the predictor having maxi- 
mal absolute inner product with y. We then increase/decrease its coefficient 
(depending on the sign of the inner product), which in turn reduces its abso- 
lute inner product with the current residual r = y — X/3. We continue until 
some other predictor has as large an absolute inner product with the cur- 
rent residual. That predictor is then included in the model, and we move 
both coefficients in the least squares equiangular direction, which keeps their 
maximal inner products with the residuals the same and decreasing. This 
process is continued, each time including variables into the model when their 
inner products catch up with the maximal inner products. Eventually all the 
inner products are zero, and the algorithm stops. If in addition we drop a 
predictor out of the model as its coefficient passes through zero, then this 
LARS algorithm delivers the entire solution set for the Lasso problem (2) 
for s running from to oo. 

Thus at any stage in the Lasso path, the predictors Xj in the model all 
have equal absolute inner product \Xj(y — Xp)\ with the residuals, and the 
predictors not in the model have a lower inner product. This is also reflected 
in the Karush-Kuhn-Tucker conditions for the Lagrange form of (2), 



(3) 



minf||y-X/3||f 2 +A||/%, 
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Table 1 

Results for the Lasso and the Dantzig selector on the diabetes data with 64 
variables (first 12 shown) 



Lasso 



Dantzig selector 



Variable j 


Xj(y-X0) 


03 


Xj(y-X0) 


03 


1 


27.4134 


0.0000 


26.0046 


0.0000 


2 


-83.6413 


-77.0062 


-83.4945 


-73.0993 


3 


83.6413 


502.8695 


62.5323 


543.7634 


4 


83.6413 


233.5998 


83.4945 


223.6250 


5 


-41.1153 


0.0000 


-43.5949 


0.0000 


6 


-33.8190 


0.0000 


-37.0429 


0.0000 


7 


-83.6413 


-164.0632 


-83.4945 


-155.4648 


8 


51.2581 


0.0000 


50.5638 


0.0000 


9 


83.6413 


463.4805 


83.4945 


455.3289 


10 


83.6413 


4.9767 


83.4945 


0.0000 


11 


76.1206 


0.0000 


75.6962 


0.0000 


12 


83.6413 


29.7423 


83.4945 


13.1410 


The Lasso and DS solutions have the same 


h norm 0\\ (l = 1734.79. 


. require 


that 










Xj(y-X(3) = \ 


• sign(/3j) 


for >0, 






\Xl{y-X(3)\<\ 




for %\ =0. 





(4) 
(5) 

The DS procedure seeks to minimize this maximal inner product directly. 
How are these different? Table 1 shows an example. The data are the larger 
version of the diabetes data, consisting of n = 442 observations and p = 64 
predictors (main effects and interactions). The variables have been standard- 
ized to have mean zero and variance 1. We have computed both the Lasso 
and DS solutions with = 1734.79. At this point, both the Lasso and 

DS have 12 nonzero coefficients. We give information for the first 12 predic- 
tors in the table. We see that in DS there is a variable (#10) attaining the 
maximum inner product that is not in the current model. This is in contrast 
to the Lasso, where the variables that achieve the maximal inner product 
are exactly the ones with nonzero coefficients, a consequence of the KKT 
conditions (4)-(5). DS does this in order to achieve a lower maximal inner 
product, here 83.49 versus 83.69 for the Lasso. On the other hand, DS gives 
variable #3 the largest coefficient (actually the largest among all 64 coeffi- 
cients), while its inner product with the residual is much smaller than that 
of other variables. As it should, the Lasso solution achieves smaller mean 
squared error than DS (2827.4 vs. 2829.4). 
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Fig. 2. Coefficient profiles as a function of for the Dantzig selector and Lasso. 

There are 64 predictors, the main effects and interactions for the diabetes data. Both paths 
were truncated at one quarter the norm of the full least squares fit, to allow us to zoom in 
on the earlier, more relevant parts of the paths. 



We found this surprising and somewhat counterintuitive. In reducing 
RSS(/3) maximally per unit increase in H/^H^, the active set for Lasso does 
correspond to the variables with largest gradients. We would have also 
guessed that these gradients were being reduced as fast as possible, but 
the DS shows this is not the case. 

Figure 2 shows the entire solution paths for Lasso and DS for the diabetes 
data. We see that the DS paths are generally wilder than those of Lasso. 

How does this behavior of DS affect its accuracy in practice? We investi- 
gate this next. 
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Fig. 3. RMSE curves for Lasso and DS for the simulation with n — 15, p = 100, and a 
sparse coefficient vector (3 with 15 nonzero entries. The left panel uses a grid on 
while the right uses a grid on \\X T (y — Xp)^^ . 



3. Comparison of prediction accuracy. We conducted a small simulation 
study to compare the Lasso and DS. We generated data from the model 

(6) y = X(3 + e, 

with X a matrix of p = 100 variables (columns) and n = 25 samples. Each of 
the entries in Xj as well as those in e were generated i.i.d. from a Gaussian 
distribution N(0, 1). The first 15 coefficients of (5 were generated from a 
iV(0, 16) distribution, and the remainder were set to zero. Hence we dubbed 
this the n <p sparse case. The Lasso and DS coefficient paths were computed 
on a grid of values for H/?^ , and for each value we computed the root-mean- 
squared error between (3 and the true (3. This was repeated 1000 times, 
generating a new X and e each time, but using the same value for f3. Thus 
for each value of H/311^, we have 1000 RMSE values corresponding to each 
of Lasso and DS. Figure 3 (left panel) shows the average and standard 
deviation for these RMSEs. Lasso is consistently below DS. The right panel 
shows a similar simulation, except here we use a grid of values for A = 
||X T (y — Xf3)\\i ao . For DS this amounts to solving the equivalent problem 
to (1): 

(7) nun II/3H/! subject to \\X T {y - X(3)\\ ioo = A, 

and for Lasso, solving the Lagrange form (3). Whichever way we look at 
these results, Lasso outperforms DS, and achieves a lower minimum. 

We repeated this simulation with everything the same except (3 was dense: 
none was zero and each was generated i.i.d. A^(0, 1). Figure 4 (left panel) 
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Fig. 4. RMSE for Lasso and DS in the n<p dense case (left panel) and n>p sparse 
case (right panel). In the n < p dense case, Lasso slightly outperforms DS but the dif- 
ferences are small. In the n > p sparse case, the performance of the procedures is not 
distinguishable; this is also the case for the n>p dense case (not shown). 



shows the results; here the differences are small. The right panel shows 
a similar simulation for the n> p case (n = 100, p = 25) and five nonzero 
elements in (3. Here the performance of the two procedures is nearly identical. 

4. Computational considerations. The DS problem (1) is a linear pro- 
gram (LP) while the Lasso (2) is a quadratic program. The LARS algorithm 
for computing the Lasso path is piecewise linear, and the computational load 
for computing the entire path is equivalent to solving a single least squares 
problem in the final set of variables. For n<p and (3 sparse, Donoho and 
Tsaig [2] argue that this is the most efficient way to solve any of the Lasso 
problems. DS will also have a piecewise-linear path algorithm (Rosset and 
Zhu [5]), but from Figure 2 it is clear that it has many more steps, and is 
unlikely to provide a similar advantage. 

5. Conclusions. The optimality properties of the Dantzig selector estab- 
lished by the authors are impressive. We wonder if similar properties hold 
for the Lasso, and hope that the authors can shed some light on this. 

From our brief study, the inherent criterion in DS for including predictors 
in the model appears to be counterintuitive, and its prediction accuracy 
seems to be similar to that of the Lasso in some settings, and inferior in 
other settings. Hence we find little reason to recommend the Dantzig selector 
over the Lasso. 
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